Prune Diseased Branches to Get Healthy Trees! How to Find Erroneous Local Trees in a Treebank and Why It Matters

نویسنده

  • Markus Dickinson
چکیده

Annotated corpora are essential for training and testing algorithms in natural language processing (NLP), but even so-called gold-standard corpora contain a significant number of annotation errors (cf. Dickinson 2005, and references therein). For part-of-speech annotation, these errors have been shown to be problematic for both training and evaluation of NLP technology (van Halteren 2000; Padro and Marquez 1998; van Halteren et al. 2001; Květǒn and Oliva 2002). But only little work has been done on detecting errors in syntactic annotation (Ule and Simov 2004; Dickinson and Meurers 2003b, 2005), and the effect of the errors detected on the uses of such corpora has not been systematically explored. In this paper, we describe a new method for finding errors in treebanks and demonstrate the effect of such errors on NLP technology. Similar to the work in Dickinson and Meurers (2003b, 2005)—where multiple occurrences of the same string in identical contexts are found with varying labels—the approach presented here is based on the detection of inconsistencies; but instead of focusing on the consistent assignment of a label to a string, we here investigate the consistency of labeling within local trees. In section 2, we describe the new method, and we show the results of applying it to the Wall Street Journal corpus in section 3. After discussing ways to automatically identify individual erroneous rules in section 4, we show in section 5 that eliminating the detected errors from the training data of a probabilistic context-free grammar (PCFG) parser improves its performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی تغییرات پارامترهای فتوسنتزی انجیلی (Parrotia persica C.A.Mey.) سالم و آلوده به دارواش (Viscum album L.) با توجه به موقعیت آن در توده

This investigation was conducted to compare the photosynthetic indices in ironwood trees infected by Viscum album L. and healthy trees located both in stand and out of stand in plain forest of Tamishan, Nour city. In each position, five healthy and five infected trees were selected and photosynthetic parameters, stomatal conductance, transpiration and internal CO2 were examined. Results showed ...

متن کامل

اثر آلودگی به Loranthus europaeus Jacq. روی برخی ویژگی‌های کمی و جذب عناصر غذایی در درختان بلوط ایرانی (مطالعه موردی: منطقه بانکول در استان ایلام)

    Mistletoe (Loranthus europaeus Jacq) is an evergreen bush which is considered as a semi-parasitic plant of Ilam forests and show off in late autumn and winter in the northern forest of Ilam. The aim of this study was to investigate the effect of mistletoe on some Quantitative characteristics and nutrient uptake in oak forests (Quercus brantii L.) of Bankol in Ilam Province. For this study,...

متن کامل

Branches in random recursive k-ary trees

In this paper, using generalized {polya} urn models we find the expected value of the size of a branch in recursive $k$-ary trees. We also find the expectation of the number of nodes of a given outdegree in a branch of such trees.

متن کامل

Effect of tree decline and slope aspect on the leaf morphological traits of Persian oak trees

This study was accomplished with the aim of investigating the effects of tree decline and slope aspect on leaf morphological traits and determining the indicator morphological traits that indicate the leaf diversity of Persian oak trees in relation to habitat conditions and environmental stresses in Mellah Siah forests of Ilam. Two forest habitats were selected on the northern and southern slop...

متن کامل

Seasonal changes in carbohydrate and nitrogen contents of olive trees ʻFishomiʼ cultivar in several parts under alternate bearing conditions

Some olive cultivars possess high tendency to alternate bearing, which have a minus economic effect on olive industry. The experiment was performed in an olive orchard located in Shiraz region. Monthly monitoring of concentrations of unstructured carbohydrates (glucose, fructose, sucrose, mannitol and starch), protein, nitrogen, potassium and phosphorus in leaves, branches and roots of olive tr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005